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Abstract 

We revisit the classical algorithms for searching over sorted sets to introduce an algorithm re¬ 
finement, called Adaptive Search, that combines the good features of Interpolation search and 
those of Binary search. W.r.t. Interpolation search, only a constant number of extra comparisons 
is introduced. Yet, under diverse input data distributions our algorithm shows costs comparable 
to that of Interpolation search, i.e., (9(loglogn) while the worst-case cost is always in 0{\ogn), 
as with Binary search. On benchmarks drawn from large datasets, both synthetic and real-life. 
Adaptive search scores better times and lesser memory accesses even than Santoro and Sidney’s 
Interpolation-Binary search. 

Keywords: Sorting, Searching sorted sets 


1. Introduction 

We revisit the classical algorithms for searching over sorted sets to introduce a new algorithm, 
called Adaptive search (AS), that combines the good features of Interpolation search and those 
of Binary search |[ll]. 

The membership problem can be formally defined as follows. 

instance: 

• S - {a\,a 2 ,..., a„], a set of n distinct, sorted elements, 
with a, < flj+i , 1 < / < u - 1; 

• an element key 

question: Does key belong to the set represented by S {key e <S) ? 

There exist two classical algorithms for searching over sorted sets: Binary search (BS) |[ll] 
and Interpolation search (IS) ||2l; both take advantage of the ordering of the instance to minimize 
the number of keys that must be accessed. 
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In BS, the worst-case computational cost is 0(logn); this result is independent of data distri¬ 
bution over the instance. Notice that in search the worst-case is rather important as it corresponds 
to an unsuccessful membership query. 

Vice versa, the Interpolation Search algorithm is more efficient than BS when the elements 
of S are distributed uniformly or quasi-imiformlj^ over the [ai,a„] interval; the computational 
cost is in (9(log log n). 

Unfortunately, Interpolation search degrades to 0{ii) when data is not uniformly distributed 
(in the sense above). This is particularly inconvenient when searching over indexes of large 
databases, where it is crucial to minimize the number of accessed. 

In this work we propose an algorithm, called Adaptive search (AS) that refines Interpolation 
search and minimizes the number of memory accesses needed to complete a search. AS is 
adaptive to the values by means of a mixed behavior: it combines the independence from the 
distribution of BS with the good average costs of IS. 

W.r.t. Interpolation search, AS requires only a constant number of extra comparisons. Yet, 
under several relevant input data distributions our algorithm shows average case costs comparable 
to those of interpolation, i.e., (9(log log «), while the worst-case cost remains in (9(log «), as with 
Binary search. 

Comparison with a more recent literature is also encouraging: both on synthetic and real 
datasets AS has better times and lesser memory accesses than Santoro and Sidney’s Interpolation- 
Binary search l^]. Also, it is easier to implement and more broadly applicable that the approach 
of Demaine et al.|01 to searching non-independent data. 

2. The Adaptive Search algorithm 

Given an ordered set S, allocated on an array A, and an element key that is searched, we 
define the following: 

A[bot]: the minimum element of the subset (at the beginning, bot - 1); 

A[top]: the maximum element of the subset (at the beginning, top - |.S|); 

A[next]: interpolation element, i.e. what IS would choose, and 
A[med]: the el. halfway between bot and top, i.e., what BS would choose. 

Our algorithm consists, essentially, of a while cycle. At each iteration, we consider S - 
{A[bot\, ..,A[top]} and we set: 


next — bot + 


key — A [bot\ 
A[top] - A[bot] 


* (top - bot) 


Variable next defined above contains the index value that bounds the array segment on which our 
AS algorithm will recur on. As with interpolation, the instance is now clipped: 


* By quasi-uniform data distribution we intended, informally, that the distance between two consecutive values of S 
does not vary much. 

^In this discussion we do not consider the advanced techniques, viz. the exploitation of locality, that underlie search 
over large database indexes. 
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{ {A[bot], ...,A[next]] \f A{bot\ < key < A{next\ 
{A[next], ...,A[top]} otherwise 

To do so, we set the new boundaries of the segment containing S': 


( top — next if A[bot'\ < key < A[next\ 
bot — next otherwise 

The computation is now restricted to the segment that would have been considered by IS. Next, 
the median point is computed over such restricted segment, rather than on the whole input. Vice 
versa, if interpolation returns a shorter interval than BS would have, we keep the result of the 
interpolation step: 

if |<S'| > Y then next - med - bot + 

elseif key - A[next] then key is found and we terminate; 
elseif key > A[next] then bot = next + 1; 

else top = next - 1 (must be key < A[next]). 

At the end of the iteration. S' - [A[bot], ...,A[top]], and, clearly, |.S'| < y- Finally: 

if A{bot\ < key < A [fop] then iterate search on S'", 

else key i S and we terminate with no. 

From the point of view of computational costs, we could summarize the following: our algorithm 
may spend up to double number of operations than IS in carefully finding out the best halving of 
the search segment, which in turn will mean that less iterations shall be needed to complete. By 
means of standard cost analysis techniques, we have the following results: 

• Best case: key is found, with a constant number of comparisons: 0(1); 

• Worst case: the intervals between values are unevenly distributed; hence, the interval found 
by the BS technique is always the shortest. As a result, AS will execute essentially the same 
search as BS, with equal (9(logn) time complexity (but more operations at each level), and 

• Average case: we consider the average case to be when the distance between two consecu¬ 
tive values of S varies according to the normal (Gaussian) probability distribution. In such 
cases, AS executes exactly as IS so its cost is in (9(log log n). 

To see why this is the case, consider the probability that a given input (an ordered set in 
array A, and a value key to be search) elicit the IS case under very mild assumptions about 
the distribution of the ordered values in A. For any given instance, we fix V = bot - top 
as the number of gaps between two consecutive values of the ordered set S. Also, we fix 
R — A[top] - A[bot] -H 1 as the numeric range over which the input values appeafl It is 


'^Whenever key is out of such range, i.e., key < A[bot] or key > A{top\ we exit with ’no’ in constant time, and so do 
other algorithms. 
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reasonable to associate the input value key to a random numerical variable X^ey distributed 
uniformly over the values interval A{bot\.. .A{top\. 

Now, consider N distinct random variables Xi,.. .X^ representing the width of each gap 
between two consecutive values of S: X, - A[i + 1] - A[!]. We associate each random 
variable X, with a normal probability distribution centered over given that X, = R. 
Hence, the expected value of each X, is Moreover, thanks to the property of the normal 
distribution by which the sum of normal distributions is a normal distribution itself, we 

N 

can obtain that PrlYjjLi Xi < |] < 5 . We can then conclude that the average case, captured 
by the normal distribution of the gaps, safely fails in the IS case. 

3. Relation with literature 

Only after our solution was conceived and implemented, have we become aware of an earlier 
work by Santoro and Sidney who devised a similar solution that combines (but does not blend) 
together interpolation and binary search. Although the asymptotic complexity is the same, there 
are some marked differences between their solution and ours, let’s discuss them now. 

Santoro-Sidney’s algorithm, called Interpolation-Binary search, is based on the idea that 
interpolation search is useful, from the point of view of costs, only when the array searched 
is larger than a given threshold. When the considered array segment is smaller than a user- 
defined threshold, binary search is applied unconditionally. Vice-versa, above the threshold an 
interpolation search step is applied, followed eventually by a binary search step. 

Unlike IBS, our algorithm makes, at each level of its iteration, a choice about which clipping 
of S to apply. Hence, it is possible to show that for any input AS will not take more elementary 
operations than IBS. 

We have sought a statistical confirmation of this fact by running a set of experiment over 
random-generated ordered sets; the results are presented in detail in the next section^ We limited 
the testing of IBS to queries with parameter 6 — 2, which the authors suggested would work best. 
For all parameter settings and for all data distributions considered AS outperformed IBS albeit 
the difference could sometimes be statistically insignificant. 

Two other works that address search over sorted sets have considered slight variations of 
the specification, that of Melhlhorn and Tsakalidis ||3l and that of Demaine et al. iQ]. The 
former considered an extended data structure, the Interpolation Search Tree (1ST) to optimize 
the dictionary operations, not just search, over the sorted set. As such, their solution is not 
comparable to ours as it seeks to optimize insertion and deletion times rather than speed up 
search. 

The latter, i.e., Demaine’s interpolation search for non independent data is also not directly 
comparable to our work, but deserves a careful analysis. They define a deterministic metric of 
“well-behaved” or smooth data that enables searching along the lines of interpolation search. 
Specifically, they define 


max{xi - Xi-i) 

min(xi — x,_i) 


^The instances and the test times are available from the companion Web site. 
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i.e., the ratio between the largest and smallest gap between two adjacent elements of S, as the key 
parameter in measuring the well-behavedness of the input. A data structure is needed that main¬ 
tains a dynamic dataset, that evenly divide the interval (xi,x„) into n bins, named B\,... Bn, 
each of them represents a range of size 

Each bin B,- stores in a balanced Binary Search Tree (BST) its elements, plus the nearest 
neighbors above and below that set. Hence, searching for an element key proceeds by interpolat¬ 
ing on key to find which Bi it may lay in, i.e.. 


. {key-xy) 

i — - 

(x„ - Xl) 

then performing a search in the BST associated to B,. For their solution, Demaine et al. prove 
the following results: 

• the worst-case search time is (9(log n) and thus (9(log min{A, n}), and 

• the algorithm reproduces the 0{loglogn) performance of interpolation search on data 
drawn independently from the uniform distribution. 

4. Experimental validation 

We have implemented AS, along with the other algorithms mentioned so far, in order to test 
its efficiency, on real data, vis-a-vis those in the literature. The testing platform consists of a Java 
implementation running on a PC with JRE 1.7, Windows 2003 server R2, dual Opteron CPU with 
4GBs of RAM. The tests consisted of running a number of searches corresponding to 1/1000 of 
the size of the dataset; keys where randomly chosen, with at least 80% of them successful. The 
results were normalized w.r.t. the number of queries. 

4.1. Validation across distributions 

As a first step, we considered random-generated benchmark instances (ordered arrays) of 
Java double data type, double-precision 64-bit IEEE 754 floating point values. Instances were 
randomly generated, with the following distribution types 

1. uniform sparsity: the gap between two consecutive values is fixed across the instance. As 
unrealistic as it is, this case is useful in assessing whether AS introduces overheads. 

2 . increasing sparsity: the gap is actually growing, so the elements towards the end (i.e., the 
highest integer values) are more distant from each other than those at the beginning. 

3. stepwise sparsity: the instance has zones with distinct, but fixed, gap sizes; the gap size 
grows towards the end of the array. 

4. Paretian: the “80-20” rule applied to the summation of the values inside the instance, i.e., 
the summation of the first 80% of elements is equal to the summation of the last 20%. 

For each parameter setting we generated and tested 10 random instances, then computed the 
average. Also, values are normalized w.r.t. the number of queries, so as to make them comparable 
across instance sizes. The results, presented in Table[T]compare the number of accesses, iterations 
and times of the four algorithms we considered. 
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no.ofAccesses 

no.oflterations 

Times 


Sizes : 

10 ^ 

10 ® 

10 ® 

10 ® 

10 ® 

10 ® 

Sparsity 

Algo. 








BS 

14,728 

18,467 

15,790 

19,155 

4.074,941 

756,956 

uniform 

IS 

4,743 

4,919 

2,888 

3,068 

2,553,635 

700,328 

IBS 2 

19,644 

23,397 

26,922 

33,613 

25,273,348 

3,639,200 


AS 

6,054 

6,290 

2,887 

3,065 

3,150,028 

1,004,239 


BS 

14,741 

18,479 

15,828 

19,156 

462,659 

831,015 

increasing 

IS 

19,613 

26,619 

18,906 

25,978 

948,907 

2,994,142 

IBS 2 

19,338 

23,502 

22,581 

29,177 

1,345,126 

3,499,578 


AS 

11,198 

12,160 

5,460 

6,016 

596,080 

1,744,790 


BS 

14,795 

18,505 

15,957 

19,171 

445,794 

753,501 

stepwise 

IS 

IBS 2 

232,945 
20,304 

329,222 
24,202 

256,465 
24, 111 

351,515 

31,096 

10,386,056 

1,485,665 

35,041,154 

3,604,115 


AS 

12,055 

12,968 

6,129 

6,708 

652,009 

1,505,453 


BS 

14,793 

18,476 

15,917 

19,157 

457,496 

916,074 

Paretian 

IS 

17,536 

21,702 

16,028 

20,209 

839,989 

2,768,519 

IBS 2 

20,252 

24,180 

25,339 

31,904 

1,509,791 

3,900,253 


AS 

10,338 

11,003 

5,097 

5,536 

564,157 

1,516,632 


Table 1: Averaged and normalized benchmai'k values over random instances with distinct data distributions. Times are 
in milliseconds. 


On aggregate, AS outperformed IBS2 as well, albeit the difference could sometimes be statis¬ 
tically insignificant. The distributions were designed to stress-test AS in an unfavorable setting, 
where quicker implementations of BS could easily make up for the extra number of iterations. 
Even though on uniform- and increasing-sparsity instances Binary search can still run slightly 
faster than AS, on aggregation AS yields a huge advantage over all other algorithms, especially 
in terms of number of accesses and iterations. 

The full benchmark results and the source codes (in Java) will be made available on a dedi¬ 
cated Web sit^B 

4.2. A real-life benchmark dataset 

To perform our analysis on real-life data with mixed or alternating distributions we used a 
public dataset on Facebook friendship released by Gjoka et al. |@]; it contains a graph of about 
957 thousands vertices (each representing a user) and 58.4 millions edges (each representing a 
friendship relation). Since each user is identified by a unique integer, and the dataset is ordered by 
user-id, it represents an ideal benchmark for testing our Adaptive search algorithm as it gives to 
us one instance of about 1 million ordered integers. Also, the dataset can be split up in 9 distinct 
subinstances of 100k elements each. The collected user-ids depend on several factors and human 
intervention, e.g., users leaving Facebook and thus having their ids removed, so subinstances 
turn out to have distinct data distributions. On top of that, the gaps between two consecutive 
user-ids depend also on how the sample was collected, as discussed, e.g., in 101. In other words. 


^http: //inf ormatica.uniine. it/adaptive-search/ 
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this real-world dataset arguably summarizes the bias introduced by user activity and web data 
collection. 

To confirm these intuitions, we performed a statistical analysis of the distribution of the gaps 
w.r.t. a null model, i.e., a random instance that we generated with the same size, the same range of 
the values and with gaps having the same average and standard deviation as the real dataset. How 
would the gaps (and therefore the user-ids) in the Facebook dataset relate to their randomized 
version? The resulting null model instance turned out to have a very different distribution of 
the gaps. For the whole dataset we found a Spearman’s rank correlation coefficient equal to 
4.95 ■ 10“^; also Pearson’s correlation coefficent was very low, at 1.66 ■ 10 this indicates that 
the FB dataset is nommifonn. 

We used the same platform and the same set-up as before for the testing; the first test consid¬ 
ered the whole Gjoka’s dataset and the aggregated results (averaged over 10 runs) are in Table |2] 
As per the synthetic benchmarks, we ran a number of searches corresponding to 1/1000 of the 
size of the dataset; keys where randomly chosen, with at least 80% of them successful. 


Algorithm 

Accesses 

Iterations 

T imeims) 

BS 

18,439 

19,136 

6,236,787 

IS 

501,346 

499,730 

74,035,808 

IBS 2 

24,097 

31,474 

28,205,959 

AS 

8,349 

4,044 

4,791,845 


Table 2: Benchmarks values over Gjoka’s dataset 


Subsequently, we have sought to confirm these results over similar datasets having diverse 
value distributions. To do so, we repeated the test on 9 sub-instances of Gjoka’s, each corre¬ 
sponding to 100k consecutive keys, i.e., positions (not values) 0-99.999, 100.000-199.999 and 
so on. In fact, the Lo (Euclidean) distance from a uniform distribution of gaps between two con¬ 
secutive values, varies widely. Nevertheless, our AS algorithm performed well on each subset, 
as it is reported in Table [3] 


Instance 

1 

2 

3 

4 

5 

6 

7 

8 

9 

IS 

1,662 

1,473 

1,494 

1,463 

1,488 

1,470 

1,483 

1,487 

1,489 

BS 

621 

522 

3,623 

300 

300 

300 

300 

300 

300 

IBS 

2,177 

2,028 

1,951 

2,044 

2,043 

2,051 

2,057 

2,047 

2,068 

AS 

889 

711 

860 

307 

310 

311 

309 

309 

309 


Table 3: Memory accesses over 9 subinstances of Gjoka’s dataset 


5. Conclusions 

Even though we have considered only the simplest instance of search, i.e., ordered sets of 
integers, it turns out that this case is of great practical interest when we consider large dataset 
extracted from, e.g., crawling Web pages or Online Social Networks, where users/resources are 
identified by simple integer keys. This is notably the case with Eacebook, which assign to each 
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Instanceno. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

IS 

1,783 

1,567 

1,613 

1,570 

1,602 

1,578 

1,589 

1,593 

1,605 

BS 

422 

351 

3,454 

100 

100 

100 

100 

100 

100 

IBS 

2,834 

2,633 

2,466 

2,630 

2,618 

2,636 

2,644 

2,643 

2,661 

AS 

417 

355 

453 

100 

100 

100 

100 

100 

100 


Table 4: Iterations over 9 subinstances of Gjoka’s dataset 


Instance no. 

1 

2 

3 

4 

5 

IS 

1,180,339 

433,851 

474,331 

447,672 

453,487 

BS 

682,630 

252,233 

1,862,590 

126,926 

127,868 

IBS 

5,104,407 

2,600,980 

2,533,918 

2,723,071 

2,523,096 

AS 

427,276 

326,454 

393,219 

135,930 

135,005 

Instance no. 

6 

7 

8 

9 

Sum 

IS 

453,342 

481,395 

71,925 

70,068 

4,066,410 

BS 

127,822 

131,044 

136,434 

130,819 

3,578,366 

IBS 

2,541,275 

2,588,292 

242,335 

254,291 

21,111,665 

AS 

134,669 

136,642 

141,063 

140,683 

1,970,941 


Table 5: Times (in milliseconds) for search over 9 subinstances of Gjoka’s dataset 


subscriber a user-id consisting of a progressive integer. On such type of data, our solution shows 
a marked improvement over the literature. The results of experiments described in the previous 
section lead us to draw the following conclusions: 

1. The performances of our AS algorithm vis-a-vis those IS and BS are very good and im¬ 
prove as n grows; 

2. The number of accesses needed by AS is less than those of BS. The cost analysis of IS 
suggests that on certain instances, i.e., when sparsity grows, our algorithm needs between 
log n and 2 log ii accesses. 

3. Our method for selecting the search interval succeeds in preventing the irregularities of 
data distribution from affecting performances; indeed, the number of accesses required 
remains = log log n. 

4. While the asymptotic complexity of our AS algorithm is the same as Santoro’s IBS, we 
have found that -on relatively diverse benchmarks- AS often needs half or less of the 
memory accesses than IBS. 

5. Even though we could not yet run a complete study on large datasets, we have indication 
that the results presented here are likely to be confirmed for search dictionaries (considered 

by Hi). 

An interesting open question is whether instances that elicit the worst case (2 log n comparisons) 
for AS can actually be found, and how likely they are to appear within real datasets. 
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